Motivation and ues cases



In [ ]:

Additional Prerequesites

If you are interested in the theory section of this post, learning the theory behind Naive Bayes first is neccessary to understand this topic. There are two reasons for this:

Naive Bayes a simpler example

What are Topic Models?

Latend Dirichlet Allocation is a topic model. A topic model is simply allows us to find abstract topics that occur in a collection of documents.

Explaining Latent Dirchlet Allocation

Latend Dirichlet Allocation (LDA in the rest of this post) allows us to group observations into unobserved (latent) groups. a

LDA in Python



In [10]:

    
#import rpy2.ipython
%load_ext rpy2.ipython



In [12]:

    
%%R

y = 2



In [6]:

    
!pip install -U rpy2









    



You are using pip version 7.0.1, however version 7.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
Collecting rpy2
  Downloading rpy2-2.6.0.tar.gz (171kB)
    100% |████████████████████████████████| 172kB 1.6MB/s 
Requirement already up-to-date: six in /Users/Will/anaconda/envs/py34/lib/python3.4/site-packages (from rpy2)
Installing collected packages: rpy2
  Found existing installation: rpy2 2.5.6
    Uninstalling rpy2-2.5.6:
      Successfully uninstalled rpy2-2.5.6
  Running setup.py install for rpy2
Successfully installed rpy2-2.6.0



In [5]:

    
rpy2.__version__









    Out[5]:





'2.5.6'

Sci-kit learn does not implement

LDA with Yelp: An Example

This is based on the paper by McAuley & Leskovec (2013).

References

https://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf

LDA Notes

Only for text corpora?
- No but it was the main thing that it was decided for.
Seen as an alternative to tf-idf
disadvantages of tf-idf is that it provides a relatively small amount of reduction in description legnth and reveals little in the way of inter- or intra-document statistical structure.
- What they mean by this is that it tells us nothing about the structure to know that the shows up more often in one document than the other.
pLSI takes a step forward compared to LSI in that it models each word in a document from a mixture model.
problem is pLSI provides no probabilistic model at the level of documents.
- Leads to overfitting with
p/LSI are based on the bag-of-words approach.
Other uses for LDA include callaborative filtering and content-based image retrieval

Dirchlet Distribution

The Dirchlet distribution is paramterized with the number of categories, $K$, and a vector of concentration parameters, $\boldsymbol{\alpha}$

It is a distribution over multinomials.

The craziness of LDA distributions

The Dirchlet distribution is a generalization of the beta distribution.
The beta distribution is itself a pretty crazy distribution:

To make the Beta distribution extra confusing, the two parameters, $\alpha$ and $\beta$ are both abstract.

a 3d distribution will model the distribution for three topics

Fun facts about the beta distribution

Model the behavior of random variables limited to intervals of finite length in a wide variety of disciplines
Used in Bayesian Inference as a conjugate prior probability distribution for various other distributions.
defined between [0,1]
$\alpha$ and $\beta$ are concentration parameters. The smaller they are, the more sparse the distribution is.
Sme people claim that it will be better
The beta distribution is built for proportions which is good.

Assumptions of LDA

Dimensionality of the Dirichlet distribution is known and fixed
Word probabilities are parameterizes by a k x V matrix $\beta$

LDA Important pictures and formulas

Where (1) is the prior probability

Beta Distribution for AB testing?

Propotions come out from a finite number of Bernoulli trials and so they are not continuous. Because of this, the beta distribution is not really appropriate here.

The normal distribution works here because of the CLT. That is an interesting point.

Intuitive Interpretations of the Dirchlet parameters

Common prior is the symmetric Dirichlet distribution where all parameters are equal. This is the case of no priors.



In [26]:

    
%pylab inline









    



Populating the interactive namespace from numpy and matplotlib



In [31]:

    
import numpy as np
s = np.random.dirichlet((10, 5, 3), 50).transpose()
plt.barh(range(50), s[0])
plt.barh(range(50), s[1], left=s[0], color='g')
plt.barh(range(50), s[2], left=s[0]+s[1], color='r')
_ = plt.title("Lengths of Strings")



In [32]:

    
[i.mean() for i in s] #Mean of each of the lengths of the string









    Out[32]:





[0.5670060910944642, 0.25410247786043716, 0.17889143104509869]



In [33]:

    
[i/18 for i in [10, 5, 3]] # Predicted values of each of the lengths of the string









    Out[33]:





[0.5555555555555556, 0.2777777777777778, 0.16666666666666666]

So essentially all of the $a$ values give you the approximate ratio of what each dimension should be.



In [ ]: